multiscale language representation
Language Through a Prism: A Spectral Approach for Multiscale Language Representations
Language exhibits structure at a wide range of scales, from subwords to words, sentences, paragraphs, and documents. We propose building models that isolate scale-specific information in deep representations, and develop methods for encouraging models during training to learn more about particular scales of interest. Our method for creating scale-specific neurons in deep NLP models constrains how the activation of a neuron can change across the tokens of an input by interpreting those activations as a digital signal and filtering out parts of its frequency spectrum. This technique enables us to extract scale-specific information from BERT representations: by filtering out different frequencies we can produce new representations that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for use during training, which constrains different neurons of a BERT model to different parts of the frequency spectrum. Our proposed BERT + Prism model is better able to predict masked tokens using long-range context, and produces individual multiscale representations that perform with comparable or improved performance across all three tasks. Our methods are general and readily applicable to other domains besides language, such as images, audio, and video.
Review for NeurIPS paper: Language Through a Prism: A Spectral Approach for Multiscale Language Representations
Weaknesses: The biggest limitation of this work for me, is the experimental setup, specifically (1) the lack of comparison to existing models (2) poor results on text classification and speech act classification when compared to existing work and (3) the choice of benchmarks. I would recommend reporting results presented in previous work on POS tagging, speech act classification and text classification. This is particularly important since you run your own BERT baselines, it would be for the reader to know how these baselines compare with numbers reported in other papers. For example, [1] reports results on 20Newsgroups and [2,3] on the switchboard dialog act classification dataset and [4,5] on POS tagging. For example [1] reports 86.8% accuracy on 20newsgroups while you report only 32.21% for BERT and 51.01 for BERT Prism.
Language Through a Prism: A Spectral Approach for Multiscale Language Representations
Language exhibits structure at a wide range of scales, from subwords to words, sentences, paragraphs, and documents. We propose building models that isolate scale-specific information in deep representations, and develop methods for encouraging models during training to learn more about particular scales of interest. Our method for creating scale-specific neurons in deep NLP models constrains how the activation of a neuron can change across the tokens of an input by interpreting those activations as a digital signal and filtering out parts of its frequency spectrum. This technique enables us to extract scale-specific information from BERT representations: by filtering out different frequencies we can produce new representations that perform well on part of speech tagging (word-level), dialog speech acts classification (utterance-level), or topic classification (document-level), while performing poorly on the other tasks. We also present a prism layer for use during training, which constrains different neurons of a BERT model to different parts of the frequency spectrum.